07 November, 2019
We know \(f_X\), and we need to study \(T(\mathbf{X})\).
KNOWN DISTRIBUTION: \(f_X\)
DATA: \(\mathbf{X} = \{ X_1, \ldots, X_n \} \overset{\text{iid}}{\sim} f_X\)
STATISTIC: \(T(\mathbf{X})\)
13.30 - 14.15 : Lecture & Live coding
14.30 - 15.15 : Exercises part 1
15.30 - 16.15 : Lecture ctd. & Live coding
16.30 - 17.15 : Exercises part 2
Background Literature: Rizzo Chapter 8 (8.1 and 8.2)
Goal:
Obtain a good idea of the sampling distribution of an statistic under a certain hypothesis (e.g. nullhypothesis)
Monte Carlo studies:
We know the true distribution of each data point under a certain hypothesis, thus we can simulate the sampling distribution of the statistic by a Monte Carlo sampling distribution.
Goal:
Obtain a good idea of the sampling distribution of an statistic under a certain hypothesis (e.g. nullhypothesis)
Permutation context:
Given our certain hypothesis, we do not assume a known distribution for each data point. However we can still simulate the sampling distribution for certain hypotheses. We can set up a Permutation sampling distribution: a sampling distribution that is conditional on your data.
Refreshing permutations and combinations
Notion of an hypothesis, represented by a sampling distribution
Notion of exchangeability
Code your own Permutation Tests
`The lady in question (Dr. Muriel Bristol) claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the specific number of cups she identified correct, but just by chance.’
This is our sampling distribution of success counts under the nullhypothesis
| Success count | “milk first” correct | Number of permutations (out of 70) |
|---|---|---|
| 0 | oooo | 1 |
| 1 | xooo oxoo ooxo ooox | 16 |
| 2 | xxoo xoxo xoox oxxo oxox ooxx | 36 |
| 3 | oxxx xoxx xxox xxxo | 16 |
| 4 | xxxx | 1 |
x = correct, o = incorrect
… applied to the lady tasting tea example.
\(m + n:\) the 8 cups in total
\(m:\) 4 “milk first” cups
\(n:\) 4 “tea first” cups
\(k:\) 4 cups the Lady recognizes as “milk first”
\(x:\) succesfully assigned “milk first”
\[P(x) = \frac{{m \choose x} {n \choose k - x}}{ m + n \choose k}\]
A Monte Carlo simulated sampling distribution of the number of succesfully assigned “milk first” cups:
B <- 7e5 Sampdistr_mc <- table(stats::rhyper(B, m = 4, n = 4, k = 4)) round(Sampdistr_mc / 1e4)
## ## 0 1 2 3 4 ## 1 16 36 16 1
Sir Ronald Fisherexperiment on what he poured first:
set.seed(1234)
Truth <- sample(rep(c("Milk", "Tea"), each = 4))
Suppose Lady Muriel Bristol Guessed all of them correctly, then our data is
X <- Truth
\(T(\mathbf{X})\): The number of correct “Milk first” cups
CountSuccesses <- function(X, truth) {
sum("Milk" == X & "Milk" == truth) # R recycling behavior trick
}
t_obs <- CountSuccesses(X = X, truth = Truth)
Suppose we do not assume a known distribution for each data point or our statistic.
Can we still simulate a good estimate of the sampling distribution for the nullhypothesis?
Yes! We can set up a Permutation sampling distribution: a sampling distribution that is conditional on your data.
There are
factorial(length(X))
## [1] 40320
possible permutations we can make of Dr. Muriel Bristol’s choices: permuted data!
Obtain all \(B = 40320\) possible samples, such that we can create the complete permutation distribution of \(T\)
X_bs <- ExtractAllPermutations(X) # each possible data set in a column dim(X_bs)
## [1] 8 40320
The full / complete permutation sampling distribution of \(T\)
T_bs <- numeric(B)
X_bs <- ExtractAllPermutations(X) # difficult function!
B <- ncol(X_bs); T_bs <- numeric(B)
for (b in 1:B) {
T_bs[b] <- CountSuccesses(X_bs[, b], Truth)
}
table(T_bs) # compare with 70 * table(T_bs) / B
## T_bs ## 0 1 2 3 4 ## 576 9216 20736 9216 576
The “default” permutation sampling distribution of \(T\)
B <- 1000; N <- length(X)
T_bs <- numeric(B)
for (b in 1:B) {
T_bs[b] <- CountSuccesses(sample(X), Truth)
}
table(T_bs) # compare with 70 * table(T_bs) / B
## T_bs ## 0 1 2 3 4 ## 9 235 507 235 14
We need…
Sir Ronald Fisherexperiment on what he poured first:
set.seed(1234)
Truth <- sample(rep(c("Milk", "Tea"), each = 4))
Suppose Lady Muriel Bristol Guessed all of them correctly, then our data is
X <- Truth
T(): The number of correct “Milk first” cups
CountSuccesses <- function(X, truth) {
sum("Milk" == X & "Milk" == truth) # R recycling behavior trick
}
t_obs <- CountSuccesses(X = X, truth = Truth)
B <- 1000; N <- length(X)
T_bs <- numeric(B)
for (b in 1:B) {
T_bs[b] <- CountSuccesses(sample(X), Truth)
}
mean(T_bs >= t_obs)
## [1] 0.014
# Note that it is an estimate of 1/70
Can we test Dr. Muriel Bristol’s ability in tasting for
E.g. number the permutations for 800 succesfully assigned cups of “Milk First” (and 200 for “Tea First”) is
choose(1000, 800)*choose(1000, 200)
## [1] Inf
Permutation test is a fine way to go!
Under the \(H_0\) for any permutation of the data, the sampling distributions of \(T(\mathbf{X})\) and \(T(\pi(\mathbf{X}))\) are the same, i.e. \(T(\mathbf{X})\) should be random among all \(T(\pi(\mathbf{X}))\).
Hypothesis: \(H_0\)
Data: \(\mathbf{X} = \{ X_1, \ldots, X_n \}\)
Test statistic: \(T(\mathbf{X})\)
Permuted data: \(\pi(\mathbf{X}) = \{ X_{\pi(1)}, \ldots, X_{\pi(n)} \}\)
DATA / HYPOTHESIS / TEST
EXPLOITING EXCHANGEABILITY
Let \(\mathbf{Z} = \{Z_1,...,Z_N\} = \{X_1,...,X_m,Y_1,...,Y_n\}\), the pooled sample, where \(N = m + n\).
Then, under the \(H_0: F = G\), the samples in \(\mathbf{Z}\) are i.i.d. and so are those in every permutation \(Z_{\pi (1)}, . . . , Z_{\pi (N)}\).
DATA \(X_1,...,X_m\) and \(Y_1,...,Y_n\) independent random samples from \(F\) and \(G\).
TEST STATISTIC \(T = T(X_1,...,X_m,Y_1,...,Y_n)\)
PERMUTATION TEST
For every partition \(b\) of the observed values \(z_1, . . . , z_N\) into 2 sets of sizes \(m\) and \(n\), compute \(T^b\) with the \(X\)s and \(Y\)s taken equal to the two sets.
For \(t_{obs} = T(x_1, \ldots, x_m, y_1, \ldots, y_n)\) the observed value, compute
\[ \widehat{p} = \frac{1}{B}\sum^B_{b = 1}1_{T^b \geq t_{obs}} \]
where \(1_{T^b \geq t_{obs}} = 1\) if \(T^b \geq t_{obs}\), and \(0\) otherwise.
DATA \(X_1,...,X_m\) and \(Y_1,...,Y_n\) independent random samples from \(F\) and \(G\).
TEST STATISTIC \(T = T(X_1,...,X_m,Y_1,...,Y_n)\)
PERMUTATION TEST
For every partition b of the observed values \(z_1, . . . , z_N\) into 2 sets of sizes \(m\) and \(n\) compute \(T^b\) with the \(X\)’s and \(Y\)’s taken equal to the two sets.
IN PRACTICE
The number of partitions \({N \choose m}\) is too large and one takes a large number of random partitions instead.
DATA / HYPOTHESIS / TEST - \((X_1, Y_1), ..., (X_n, Y_n)\) random sample from unknown bivariate distribution.
- \(H_0: F_{X_i} = G_{Y_i}\), within each pair \(X\) and \(Y\) have the same (marginal) distribution. - Reject \(H_0\) if \(T = T((X_1, Y_1), ..., (X_n, Y_n))\) is extreme
EXCHANGEABILITY
Under \(H_0\) each \(X\) is exchangeable with its \(Y\). We can permute within each pair, thus we can create \(2^n\) permuted data sets: \(\pi(X_1, Y_1), ..., \pi(X_n, Y_n)\)
DATA \((X_1, Y_1), ..., (X_n, Y_n)\) random sample from bivariate distribution.
TEST STATISTIC \(T = T((X_1, Y_1), ..., (X_n, Y_n))\)
PERMUTATION TEST
For each \(b\) of the \(B = 2^n\) possible permuted data sets were \(X\) and \(Y\)-labels may have been swapped with each other, compute \(T^b = T(\pi(X_1, Y_1), ..., \pi(X_n, Y_n)\).
For the observed \(t_{obs} = T(x_1, ..., x_m, y_1, ..., y_n)\), compute
\[\widehat{p} = \frac{1}{B} \sum^B_{b = 1} 1_{T^b \geq t_{obs}} \]
PERMUTATION TEST
For each \(b\) of the \(B = 2^n\) possible permuted data sets were \(X\) and \(Y\)-labels may have been swapped with each other, compute \(T^b = T(\pi(X_1, Y_1), ..., \pi(X_n, Y_n)\).
For the observed \(t_{obs} = T(x_1, ..., x_m, y_1, ..., y_n)\), compute
\[\widehat{p} = \frac{1}{B} \sum^B_{b = 1} 1_{T^b \geq t_{obs}} \]
IN PRACTICE the number of possible reassignments is too large and one takes a large number random reassignments instead.
Random sample \(X_1, ... , X_n \overset{\text{iid}}{\sim} F\), where \(F\) is unknown but symmetric about \(\theta\).
Suppose the distribution \(F\) is symmetric about \(\theta\), then for \(\{S_1, \ldots, S_n\}\) random signs, \(\{X_1, \ldots, X_n\}\) and \(\{S_1(X_1 - \theta) + \theta, \ldots, S_n(X_n - \theta) + \theta)\}\) and have the same distribution.
X <- rnorm(1e6, mean = 3) mean(X)
## [1] 2.998974
exchangers <- sample(c(-1,1), length(X), replace = TRUE) mean(exchangers * (X - 3) + 3)
## [1] 2.999888
Random sample \(X_1, ... , X_n \overset{\text{iid}}{\sim} F\), where \(F\) is unknown but symmetric about \(\theta\).
Suppose \(H_0: \theta = 0.4\), and we have the following sample \(\mathbf{X}\) where we don’t know with what symmetric distribution the data was created:
DATA EXAMPLE:
set.seed(191107); n <- 1e2 X <- rnorm(n, mean = 0.25) # we don't understand this code t_obs <- mean(X) t_obs
## [1] 0.2201182
“PERMUTATION” TEST
For each of the \(2^n\) possible sign vectors, compute \(T^b = T(S_1X_1, \ldots, S_nX_n)\).
In practice the number of sign vectors is too large and one takes a large number of random sign vectors instead.
The Permutation test:
For each b out of \(B\) samples from the \(2^n\) possible sign vectors, compute \(T^b = T(S_1X_1, \ldots, S_nX_n)\).
For \(t_{obs} = T(X_1, ... , X_n)\) the observed value, compute \[ \widehat{p} = \frac{1}{B} \sum^B_{b = 1} 1_{T^b \geq t_{obs}}\]
theta <- 0.4; t_obs <- abs(mean(X) - theta)
B <- 1e3; t_prms <- numeric(B)
for (b in 1:B) {
S <- sample(c(-1, 1), length(X), replace = TRUE) # random signs
t_prms[b] <- abs(mean(S * (X - theta)))
}
pval <- mean(t_prms >= t_obs)
pval
## [1] 0.07
Even if you have found a good exchangeability principle, the whole frameworks breaks down if you have too little data (\(N\) is too small), or when you use too few permuted data sets (\(B\)).
Often a permutation \(p\)-value based on random permutations is simply seen as an estimate of the \(p\)-value based on all permutations.